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(54) Title: METHOD AND APPARATUS FOR CUSTOM PROCESSOR OPERATIONS 



(57) Abstract 

Custom operations are useable in pro- 
cessor systems for performing functions in- 
cluding multimedia functions. These custom 
operations enhance a system, such as PC sys- 
tem (figure 1), to provide real-time multime- 
dia capabilities while maintaining advantages 
of a special-purpose, embedded solution, i.e., 
low cost and chip count, and advantages of 
a general-purpose reprogrammability. These 
custom operations work in a computer sys- 
tem (figure 1) which supplies input data hav- 
ing operand data (figure 8, rsrcl and rsrc2), 
performs operations on the operand data, and 
suppplies result data to a destination register 
(figure 8, rdest). Operations performed may 
include audio and video processing including 
clipping or saturation operations. The present 
invention also performs parallel operations on 
select operand data from input registers (fig- 
ure 8, rsrcl and rsrc2) and stores results in the 
destination register (figure 8, rsrcl and rsrc2). 
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METHOD AND APPARATUS FOR CUSTOM PROCESSOR OPERATIONS 

CROSS - REFER RTJ CE TO RF.TATED APPLICATIONS 

This application claims the benefit of U.S. 
Provisional Application 60/003,140 filed September 1, 1995, 
and U.S. Provisional Application No. 60/004,642 filed 
September 25, 1995. 

The following applications are incorporated by 
reference herein for discussion of VLIW processing systems: 

US Patent No. 5,103,311: DATA PROCESSING MODULE AND 
VIDEO PROCESSING SYSTEM INCORPORATING SAME; 

US Patent No. 5,450,556: VLIW PROCESSOR WHICH USES 
PATH INFORMATION GENERATED BY A BRANCH CONTROL UNIT TO 
INHIBIT OPERATIONS WHICH ARE NOT ON A CORRECT PATH; 

US Patent No. 5,313,551: MULTIPORT MEMORY BYPASS 
UNDER SOFTWARE CONTROL; 

US Application Serial No. 07/998,080 filed December 
29, 1992 entitled VLIW PROCESSOR WITH LESS INSTRUCTION ISSUE 
. SLOTS THAN FUNCTIONAL UNITS; 

US Serial No. 07/594,534 filed October 5, 1990 
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2 . Descrip tion of the Related Art 

A system may include a general -purpose CPU and 
additional units to serve as a multi- function PC enhancement 
vehicle. Typically, a PC must deal with multi -standard 
video and audio streams, and users desire both decompression 
and compression, if possible. While the CPU chips used in 
PCS are becoming capable of low- resolution real-time video 
decompression, high-quality video decompression and 
compression are still not possible. Further, users demand 
that their systems provide live video and audio without 
sacrificing responsiveness of the system. 

For both general -purpose and embedded 
microprocessor-based applications, programming in a high- 
level language is desirable. To effectively support 
optimizing compilers and a simple programming model, certain 
microprocessor architecture features are needed, such as a 
large, linear address space, general -purpose registers, and 
register- to- register operations that directly support 
manipulation of linear address pointers. A recently common 
choice in microprocessor architectures is 32 -bit linear 
addresses, 32 -bit registers, and 32 -bit integer operations 



WO 97/09679 



PCT/US96/14155 



of m-bit execution hardware in the implementation. 

Logic of conventional dsp or DSP (digital signal 
processing) operations calculates modulo values. Clipping 
or saturation operations of the present invention are 
especially valuable in signal processing applications where 
the processing generates data that may run beyond physical 
limits of the registers. Conventionally, when this occurs, 
data are mapped to the other end of the physically available 
range. in processing of signals, this cyclical mapping can 
be disastrous. For example, a very low audio volume would 
be mapped onto the highest using the conventional scheme, 
in control applications and in video/audio applications 
modulo values are not desirable when the control range or 
intensity range saturates. 



SUMMARY OF THE PRFSF.NT INVKNTTDM 

The present invention enhances a system, such as a 
PC system, to provide real-time multimedia capabilities 
while maintaining advantages of a special -purpose, embedded 
solution, i.e. low cost and chip count, and advantages of a 
general -purpose processor- reprogramability . For PC 
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Another object of the present invention is to use 
multiple operand registers storing multiple operands for 
parallel processing under control of a single instruction. 
This is particularly advantageous in audio and/or video 
applications where samples are currently 8 or 16 bits. 

An object of the present invention is to use a 
clipping operation to keep received signals, such as audio 
or video signals, in the correct side of a truncated range. 

The present invention can be used in low-cost, 
single -purpose systems such as video phones to 
reprogrammable, multi-purpose plug- in cards for traditional 
personal computers. Additionally, the present invention may 
be used in a system which easily implements popular 
multimedia standards such as MPEG-l and MPEG- 2. Moreover, 
orientation of the present invention around a powerful 
general -purpose CPU makes it capable of implementing a 
variety of multimedia algorithms, whether open or 
proprietary. 

Defining software compatibility at a source- code 
level has an advantage of providing freedom to strike an 
optimum balance between cost and performance. Powerful 
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contemplated of carrying out the invention. As will be 
realized, the invention is capable of other and different 
embodiments, and its several details are capable of 
modifications in various obvious respects, all without 
departing from the invention. Accordingly, the drawings and 
description are to be regarded as illustrative in nature, 
and not as restrictive. 

BRIEF DESC RIPTION OF THE DRAWINGS 
These objects as well as other objects of the 
present invention will be apparent from the description of 
the present invention including the aid of the following 
drawings : 

Figure 1 is a block diagram of an example system 
for use with the present invention; 

Figure 2 illustrates an example of CPU register 
architecture; 

Figure 3 (a) illustrates an example of an 
organization of a matrix in memory; 

Figure 3(b) illustrates a task to be performed in 
the example; 

9 
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Figure 


22 


illustrates 


a mergemsb operation; 


Figure 


23 


illustrates 


a pack!61sb operation; 


Figure 


24 


illustrafpc; 


ci pacjc-LomsD operation; 


Figure 


25 


illustrates 


a packbytes operation; 


Figure 


26 


illustrates 


a quadavg operation; 


Figure 


27 


illustrates 


a quadumulmsb operation; 


Figure 


28 


illustrates 


an umeSii operation; 


Figure 


29 


illustrates 


an ume8uu operation; 


Figure 


30 


illustrates 


an iclipi operation; 


Figure 


31 


illustrates 


an uclipi operation; and 


Figure 


32 


illustrates 


an uclipu operation. 



DESCRIPTION OF PREFERRED EMBODIMENTS 
Figure 1 shows a block diagram of an example system 
for use with the present invention. This system includes a 
microprocessor, a block of synchronous dynamic RAM (SDRAM) , 
and external circuitry needed to interface to incoming 
and/or outgoing multimedia data streams. 

In this example, a 32 -bit CPU forms a VLIW 
processor core. The CPU implements a 32 -bit linear address 
space and 128 fully general -purpose 32-bit registers. In 

11 
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"nail t imedia " opera t ions . 

Figure 2 illustrates one example of a CPU register 
architecture. The CPU of the present embodiment has 128 
fully general -purpose 32-bit registers, labeled r0..rl27. 
In this embodiment, registers rO and rl are used for special 
purposes and registers r2 through r!27 are true general 
purpose registers. 

In the present system, the processor issues one 
long instruction every clock cycle. Each such instruction 
includes several operations (5 operations for the present 
embodiment) . Each operation is comparable to a RISC machine 
instruction, except that execution of an operation is 
conditional upon the content of a general purpose register. 

Data in the register may be in, for example, 
integer representation or floating point representation. 

Integers may be considered, in the present 
embodiment, as •unsigned integers' or 'signed integers', as 
binary and two's complement bit patterns, respectively. 
Arithmetic on integers does not generate traps. If a result 
is not representable, the bit pattern returned is operation 
specific, as defined in the individual operation description 

13 
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The 'if r23' clause evaluates TRUE or FALSE depending on the 
LSB of the value in r23. Hence, depending on the LSB of 
r23, rl3 is either unchanged or set to contain an integer 
sum of rl4 and no. For example, in this embodiment of the 
present invention, if the LSB is evaluated as 1, a 
destination register (rdest) , in this example rl3, is 
written. Guarding controls effects on programmer visible 
states of the system, i.e. register values, memory content 
and device state. 

Memory in the present invention is byte 
addressable. Loads and stores are 'naturally aligned', i.e. 
a 16 -bit load or store targets an address that is a multiple 
of 2. A 32-bit load or store targets an address that is a 
multiple of 4. One skilled in the art could easily modify 
this . 

Compute operations are register- to -register 
operations. A specified operation is performed on one or 
two source registers and a result is written to a 
destination register (rdest) . 

Custom operations are special compute operations 
and are like normal compute operations; however, these 

15 
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entire resources, such as, for example, 32 -bit resources, co 
operate on two sixteen-bit data items or four eight -bit data 
items simultaneously. This use improves performance by a 
significant factor with only a tiny increase in 
implementation cost. Additionally, this use achieves a high 
execution rate from standard microprocessor resources. 

Some high- function custom operations eliminate 
conditional branches, which helps a scheduler effectively 
use five operation slots in each instruction of the present 
system, for example, the Philips TM-l chip with TM-1 
instructions. Filling up all five slots is especially 
important in inner loops of computationally intensive 
multimedia applications. Custom operations help the present 
invention achieve extremely high multimedia performance at 
the lowest possible cost. 

Table 1 is a listing of custom operations of the 
present invention. Some custom operations exist in several 
versions that differ in treatment of respective operands and 
results. Mnemonics for these different versions attempt to 
clarify the respective treatment to aid in selection of the 
appropriate operation, although clearly, different mnemonics 

17 
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Table 1. Custom operations listed by function type 



Function 


Custom Pop 


Description 


DSP 

absolute 

value 


dspiabs 


Clipped signed 32-bit absolute value 


dspiduaiabs 


Dual clipped absolute values of signed 16-bit halfwords 


DSP add 


dspiadd 


Clipped signed 32-bit add 


dspuadd 


Clipped unsigned 32-bit add 


dspiduaJadd 


Dual clipped add of signed 16-bit halfwords 


dspuquadaddui 


Quad clipped add of unsigned/signed bytes 


DSP 
multiply 


dspimul 


Clipped signed 32-bit multiply 


dspumui 


Clipped unsigned 32-bit multiply 


dspiduaimul 


Dual clipped multiply of signed 16-bit halfwords 


DSP 
subtract 


dspisub 


Clipped signed 32-bit subtract 


dspusub 


Clipped unsigned 32-bit subtract 


dspidualsub 


Dual clipped subtract of signed 16-bit halfwords 


Sum of 
products 


ifirl6 


Signed sum of products of signed 16-bit halfwords 


ifirSii 


Signed sum of products of signed bytes 


ifir8ui 


Signed sum of products of unsigned/signed bytes 


ufirl6 


Unsigned sum of products of unsigned 16-bit halfwords 


uflrSuu 


Unsigned sum of products of unsigned bytes 


Merge 


merge! sb 


Merge least-significant bytes 


mergemsb 


Merge most-significant bytes 
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example, contain eight -bit pixel values. Figure 3(a) 
illustrates both organization of the matrix in memory and, 
Figure 3(b) illustrates in standard mathematical notation, 
the task to be performed. 

Performing this operation with traditional 
microprocessor instructions is straight forward but time 
consuming. One method to perform the manipulation is to 
perform 12 load-byte instructions to load bytes (since only 
12 of the 16 bytes need to .be repositioned) and 12 store- 
byte instructions to store the bytes back in memory in their 
new positions. Another method would be to perform four 
load- word instructions, reposition bytes of the loaded words 
in registers, and then perform four store-word instructions. 
Unfortunately, repositioning the bytes in registers requires 
a large number of instructions to properly shift and mask 
the bytes. Performing twenty four loads and stores makes 
implicit use of shifting and masking hardware in load/store 
units and thus yields a shorter instruction sequence. 

The problem with performing twenty four loads and 
stores is that loads and stores are inherently slow 
operations: they must access at least cache and possibly 

21 



WO 97/09679 



PCT/US96/14155 



matrix into registers rlO, rll, rl2, and rl3 . A next 
sequence of four merge operations (mergemsb and mergelsb) 
produces intermediate results in registers rl4, rl5, rl6, 
and rl7. A next sequence of four pack operations (packl6msb 
and packl61sb) may then replace the original operands or 
place the transposed matrix in separate registers if the 
original matrix operands were needed for further 
computations (a TM-1 optimizing C compiler could perform 
such an analysis automatically). In this example, the 
transpose matrix is placed in separate registers (st32d) , 
registers rl8, rl9, r20, and r21. Four final four store- 
word operations put the transposed matrix back into memory. 

Thus, using the custom operations of the present 
invention, the byte-matrix transposition requires four-word 
operations and four store- word operations (the minimum 
possible) and eight register- to- register data manipulation 
operations. The result is 16 operations, or byte-matrix 
transposition at a rate of one operation per byte. Figure 
5(b) illustrates an equivalent C-language fragment. 

While the advantage of the custom- operation-based 
algorithm over brute- force code that uses 24 load-and store- 

23 
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hardware . 

As this example has shown, use of the custom 
operations of the present invention may reduce the absolute 
number of operations needed to perform a computation and can 
also help a compiling system produce code that fully 
exploits the performance potential of the respective CPU. 
Other applications such as MPEG image reconstruction for, 
for example, a complete MPEG video decoding algorithm and 
motion- estimation kernels could be benefited by use of the 
custom operations of the present invention, although this is 
not exhaustive. 

The present invention includes those custom 
operations listed in Table 1. The specifics of each of 
these custom operations are set forth below. In the 
function code given below, standard symbols, syntax, etc. 
are used. For example, tempi and temp2 represent temporary 
registers. Further, as an example, a function tempi - 
sign_extl6to32(rsrcl<l5:0>) means that tempi is loaded with 
the 15:0 bits (bits 0 to 15) of the rsrcl register with the 
sign bit (in this example, the 15th bit) being extended to 
the 16 to 32 bits (sign bit extension). Similarly, temp2 *- 

25 
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if rguard then { 

if rsrci> = o then 

rdest — rsrci 
else if rsrci = 0x800000000 then 

rdest - 0x7fffffff 

else 

rdest ♦- rsrci 

} 

The dspiabs operation is a pseudo operation 
transformed by the scheduler into an h_dspiabs with a 
constant first argument zero and second argument equal to 
the dspiabs argument- Pseudo operations generally are not 
used in assembly source files. h_dspiabs performs the same 
function; however, this operation requires a zero as first 
argument . 

The dspiabs operation computes the absolute value 
of rsrci, clips the result into a range [2 31 -i...O] or 
[0x7fffffff... 0 ], and stores the clipped value into rdest (a 
destination register). All values are signed integers. 
dspidualabs : dspidualabs is a dual clipped absolute value 
of signed 16-bit halfwords operation, pseudo-op for 
h_dspidualabs (hardware dspidualabs). This operation has 
the following function: 

if rguard then { 

tempi - sign_extl6to32 (rsrcl<l5:0>) 
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rdest — 0x80000000 
else if temp > 0x00000007f f f f f f f then 

rdest - 0x7fffffff 
else 

rdest *- temp 

} 

As shown in Figure 6, the dspiadd operation 
computes a signed sum rsrci + rsrc2, clips the result into a 
32-bit signed range [2 31 - i -2 31 ] or 

[0x7fffffff... 0x80000000], and stores the clipped value into 
rdest. All values are signed integers. 

dspuaddt dspuadd i s a clipped unsigned add operation. This 

operation has the following function: 

if rguard then { 
temp «- 

zero_ext32to64 (rsrcl) +zero_ext32to64 (rsrc2) 

if (unsigned) temp > OxOOOOOOOOf f f f f f f f then 

rdest - Oxffffffff 
else 

rdest - temp<31:0> 

} 

As shown in Figure 7 the dspuadd operation computes 
an unsigned sum rsrcl + rsrc2, clips the result into an 
unsigned range [2 32 -l... 0 ] or [Oxf fffff ff ... 0 ] , and stores 
the clipped value into rdest. 

dspidualadd: dspidualadd is a dual clipped add of signed 
16 -bit half words operation. This operation has the 
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else rdest<m:n> «- temp<7:0> 

} 

As shown in Figure 9, the dspuquadaddui operation 
computes four separate sums of four respective pairs of 
corresponding 8 -bit bytes of rsrcl and rsrc2. Bytes in 
rsrci are considered unsigned values; bytes in rsrc2 are 
considered signed values. The four sums are clipped into an 
unsigned range [255... 0] or [0xff...0] ; thus, resulting byte 
sums are unsigned. All computations are performed without 
loss of precision. 

dspimul : dspimul is a clipped signed multiply operation. 

This operation has the following function: 

if rguard then { 
temp «- 

sign_ext32to64 (rsrcl) +sign_ext32to64 (rsrc2) 

if temp < Oxffffffff 80000000 then 

rdest *- 0x80000000 
else if temp > 0x000000007f f f f f f f then 

rdest - 0x7fffffff 
else 

^ rdest - temp<31:0> 

As shown in Figure 10, the dspimul operation 
computes a product rsrcl x rsrc2, clips the results into a 
range [2 31 - 1...-2 31 ] or [0x7f fffff f ... 0x80000000] , and 
stores the clipped value into rdest. All values are signed 
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rdest<15:0> «- templ<15:0> 

} 

As shown in Figure 12 , the dspidualinul operation 
computes two 16 -bit clipped, signed products separately on 
two respective pairs of high and low 16 -bit half words of 
rsrcl and rsrc2. Both products are clipped into a range [2 15 
-1... -2 15 ] or [0x7fff . ..0x8000] and written into 
corresponding half words of rdest.. All values are signed 16- 
bit integers . 

dap i sub; dspisub is a clipped signed subtract operation. 

This operation has the following function: 

if rguard then { 

temp — sign_ext32to64 (rsrcl) - 
sign_ext32to64 (rsrc2) 

if temp < Oxffffffff 80000000 then 

rdest - 0x80000000 
else if temp > 0x000000007ff f f f f f then 

rdest — 0x7fffffff 
else 

rdest - temp<3l:0> 

As shown in Figure 13, the dspisub operation 
computes a difference rsrcl -rsrc2, clips the result into a 
range ( 0x80000000 0x7fffff ff] , and stores the clipped value 
into rdest. All values are signed integers. 
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computes two 16-bit clipped, signed differences separately 
on two respective pairs of high and low 16 -bit half words of 
rsrcl and rsrc2 . Both differences are clipped into a range 
[2 15 -l, , , -2 15 or [0x7fff. . .0x8000] and written into 
corresponding halfwords of rdest. All values are signed 16- 
bit integers. 

ifirl6s ifirl6 is a sum of products of signed 16-bit 
halfwords operation. This operation has the following 
function: 

if rguard then 

rdest «- sign_ext!6to32 (rsrcl<31:16>)x 
sign_ext!6to32 (rsrc2<31 : 16>) + 
sign_extl6to32 (rsrcl<15 : 0>) x 
sign_ext!6to32 (rsrc2<15 : 0>) 

As shown in Figure 16 , the ifir!6 operation 

computes two separate products of two respective pairs of 

corresponding 16 -bit halfwords of rsrcl and rsrc2; the two 

products are summed, and the result is written to rdest. 

All halfwords are considered signed; thus, the products and 

the final sum of products are signed. All computations are 

performed without loss of precision. 

ifir8ii: ifir8ii is a signed sum of products of signed 
bytes operation. This operation has the following function: 
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computes four separate products of four respective pairs of 
corresponding 8 -bit bytes of rsrcl and rsrc2; the four 
products are summed, and the result is written to rdest. 
Bytes from rsrcl are considered unsigned, but bytes from 
rsrc2 are considered signed; thus, the products and the 
final sum of products are signed. All computations are 
performed without loss of precision. 

uf irl6 1 ufirl6 is a sum of products of unsigned 16 -bit 
half words operation. This operation has the following 
function: 

if rguard then { 

rdest - (zero_extl6to32 (rsrcl<31:l6>)x 
zero_extl6to32 (rsrc2<31 : 16>) + 
zero_extl6to32 (rsrcl<15 :0>)x 
zero_extl6to32 (rsrc2<15 : 0>) 

As shown in Figure 19, the ufirl6 operation 

computes two separate products of two respective pairs of 

corresponding 16-bit half words of rsrcl and rsrc2, the two 

products are summed, and the result is written to rdest. 

All half words are considered unsigned; thus, the products 

and the final sum of products are unsigned. All 

computations are performed without loss of precision. The 

final sum of products is clipped into the range 

37 
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interleaves two respective pairs of least -significant bytes 
from arguments rsrcl and rsrc2 into rdest. The least - 
significant byte from rsrc2 is packed into the least - 
significant byte of rdest; the least significant byte from 
rsrcl is packed into the second- least -significant byte or 
rdest; the second- least -significant byte from rsrc2 is 
packed into the second-most -significant byte of rdest; and 
the second- least -significant byte from rsrcl is packed into 
the most -significant byte of rdest, 

merqeTnnhr mergemsb is a merge most -significant byte 

operation. This operation has the following function: 

if r guard then { 

rdest<7:0> - rsrc2<23:15> 
rdest<15:8> — rsrcl<23:15> 
rdest<23;16> - rsrc2<31;24> 
rdest<31:24> - rsrcl<31:24> 

As shown in Figure 22, the mergemsb operation 

interleaves the two respective pairs of most -significant 

bytes from arguments rsrcl and rsrc2 into rdest. The 

second-most -significant byte from rsrc2 is packed into the 

least -significant byte of rdest; the second-most -significant 

byte from rsrcl is packed into the second- least -significant 

byte or rdest, the most -significant byte from rsrc2 is 

39 
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arguments rsrcl and rsrc2 into rdest. The halfword from 
rsrcl is packed into the most- significant halfword of rdest 
and the halfword from rsrc2 is packed into the least - 
significant halfword or rdest. 

packbytes : packbytes is a pack least -significant byte 
operation. This operation has the following function: 

if rguard then { 

rdest<7:0> *- rsrc2<7:0> 
rdest<i5:8> rsrcl<7:0> 

} 

As shown in Figure 25, the packbytes operation 
packs two respective least -significant bytes from arguments 
rsrcl and rsrc2 into rdest. The byte from rsrcl is packed 
into the second -least -significant byte of rdest and the byte 
from rsrc2 is packed into the least- significant byte or 
rdest. The two most -significant bytes of rdest are filled 
with zeros. 

guadavg: quadavg is a unsigned byte -wise quad average 

operation. This operation has the following function: 

if rguard then { 

temp - (zero_ext8to32 (rsrcl<7:0>) + 

zero_ext8to32 (rsrc2<7:0>) + 1) /2 
rdest<7 : 0>*-temp<7 : o> 
temp «- (zero_ext8to32 (rsrcl<15:8>) + 

zero_ext8to32(rsrc2<l5:8>)+ l) /2 
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rdest<31:24>«-temp<15 :8> 

} 

As shown in Figure 27, the quadumulmsb operation 
computes four separate products of four respective pairs of 
corresponding 8 -bit bytes of rsrcl and rsrc2. All bytes ar 
considered unsigned. The most -significant 8 bits of each 
16 -bit product is written to the corresponding byte in 
rdest . 

U3ne81i: ume8ii is a unsigned sum of absolute values of 

signed 8 -bit differences operation. This operation has the 

following function: 

if rguard then 

rdest - abs_yal (sign_ext8to32 (rsrcl<31:24>) - 
sign_ext8to32 (rsrc2<31 :24>) ) + 
abs_val (sign_ext8to32 (rsrcl<23 : 16>> - 
sign_ext8to32 (rsrc2<23 : 16>) ) + 
abs_val (sign_ext8to32 (rsrcKlS : 8>) - 
sign_ext8to32 (rsrc2<15 : 8>) ) + 
abs_val (sign_ext8to32 (rsrcl<7:0) - 
sign_ext8to32 (rsrc2<7:0>) ) 

As shown in Figure 28 , the ume8ii operation 

computes four separate differences of four respective pairs 

of corresponding signed 8 -bit bytes of rsrcl and rsrc2, 

absolute values of the four differences are summed, and the 

sum is written to rdest. All computations are performed 
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integer; rsrc2 is considered an unsigned integer and must 

have a value between 0 and 0x7fffffff inclusive, 

uclipi; uclipi is a clip signed to unsigned operation. 

This operation has the following function: 

if rguard then 

rdest *- min (max (rsrcl, 0) , rsrc2) 

The uclipi operation returns a value of rsrcl 

clipped into unsigned integer range 0 to rsrc2, inclusive. 

The argument rsrcl is considered an unsigned integer; rsrc2 

is considered an unsigned integer. 

uclipu; uclipu is a clip unsigned to unsigned operation. 

This operation has the following function: 

if rguard then{ 

if rsrcl > rsrc2 then 

rdest rsrc2 
else 

rdest*- rsrcl 

} 

The uclipu operation returns a value of rsrcl 
clipped into unsigned integer range 0 to rsrc2, inclusive. 
The arguments rsrcl and rsrc2 are considered unsigned 
integers . 

By use of the above custom multimedia operations, 
an application can take advantage of highly parallel 
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What is claimed is: 

1. A computer system comprising: 

input registers receiving input data, each input 
data comprising M bits and having operand data comprising N 
bits of the M bits of input data, where N bits is less than 
M bits; 

a processor for performing a number of operations 
Q on the operand data of the input data in parallel, 
producing result data of N bits, under control of an 
instruction of an instruction set; 

a destination register for storing Q groups of 
result data as one output of M bits. 

2. A computer system comprising: 

input registers for supplying input data of M 
bits, each input data comprising at least two operand data, 
each operand data comprising N bits, where N is less than M; 

a special purpose processor for performing a group 
of operations in parallel for selected sets of operand data 
of the input data, each group of operations producing result 
data of N bits, said processor performing in response to an 
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4. A computer system as recited in claim 2, wherein: 

the input data of M bits comprises two operand 
data of N bits each; 

said processor comprises: 

means for computing an absolute value of 
each operand data, each computation producing a respective 
absolute value of N bits, and 

means for clipping each respective 
absolute value into a specified range to produce respective 
clipped results, each clipped result being N bits; and 

said destination register stores the respective 
clipped results together. 

5. A computer system as recited in claim 2, wherein: 

the input data of M bits comprises a first operand 
comprising N bits and a second operand comprising N bits; 
said processor comprises: 

means for multiplying the first operand 
data of a first input data to the first operand of a second 
input data to produce a first product and multiplying the 
second operand data of the first input data to the second 
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means for clipping the first difference 
and the second difference into a specified range to produce 
respective clipped results, each clipped result comprising N 
bits ; and 

said destination register stores the respective 
clipped results together. 

7. A computer system as recited in claim 2, wherein: 

the input data of M bits comprises P operand data 
of N bits each, P being at least two; 

said processor comprises: 

means for adding a respective operand data 
of a first input data to a respective operand data of a 
second input data for each operand data of the P operand 
data, each adding producing a respective sum of N bits, and 

means for clipping each respective sum 
into a specified range to produce P respective clipped 
results, each clipped result being N bits; and 

said destination register stores the P respective 
clipped results. 
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second input data for each operand data of the P operand 
data, each multiplying producing a respective product of N 
bits, and 

means for clipping each respective product 
into a specified range to produce P respective clipped 
results, each clipped result being N bits; and 

said destination register stores the P respective 
clipped results. 

10. A computer system as recited in claim 2, wherein: 

the input data of M bits comprises p operand data 
of N bits each, p being at least two; 

said processor comprises means for computing a 
respective average of a respective operand data of a first 
input data and a respective operand data of a second input 
data for each operand data of the P operand data, each 
computing producing a respective average of N bits; and 

said destination register stores the P respective 

averages . 

11. A computer system as recited in claim 2, wherein: 
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an input register for supplying input data of M 

bits; 

a processor for retrieving N bits of data of the 
input data, N being less than M, for P input data, said 
processor comprising means for packing the respective 
retrieved N bits of data of the P input data in a 
destination register. 

14. A computer system as recited in claim 12 wherein: 

P is two; 

N is half of M; and 

said processor retrieves one of the most 
significant bits of the input data or the least significant 
bits of the input data. 

15. A computer system as recited in claim 12, wherein: 

a first and a second input data are supplied; 

said processor retrieves the most significant bits 
(msb) of each respective input data, each respective most 
significant bits being supplied as the most significant bits 
(mmsb) of the most significant bits and the least 
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least significant bits being supplied as the most 
significant bits (mlsb) of the least significant bits and 
the least significant bits (li sb ) of the least significant 
bits; 

said means for packing packs the most significant 
bits of the least significant bits (mlsb) of the first input 
data as the most significant bits of a destination register; 

said means for packing packs the most significant 
bits of the least significant bits (mlsb) of the second 
input data as the next most significant bits of the 
destination register; 

said means for packing packs the least significant 
bits of the least significant bits (llsb) of the second 
input data as the least significant bits of the destination 
register; and 

said means for packing packs the least, significant 
bits of the least significant bits (llsb) of the first input 
data as the next least significant bits of the destination 
register. 

17. A computer system as recited in claim 12, wherein: 
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result data comprising N bits for each group of operations 
performed; and 

a destination register for storing Q result data 
as output data comprising M bits. 

19. A computer system as recited in claim 18, wherein said 
processing is at least one of audio processing and video 
processing. 

20. A computer system as recited in claim 18, wherein said 
computer system is integrated on a semiconductor substrate. 

21. A computer system comprising: 

input registers for supplying input data, the 
input data comprising operand data; 

a processor for performing a number of operations 
on the operand data, the operations including a clipping 
function, said processor producing result data; and 

a destination register for storing selected data 
of the result data. 



59 



WO 97/09679 



PCT/US96/14155 



means for storing the respective output data together. 

24. A signal processing system for processing signal data, 
said system comprising: 

at least one input register for storing and supplying the 
signal data; and 

a processor for performing, under instruction control, a 
plurality of instructions available in hardware , each 
instruction directing said processor to perform at least one 
operation to produce result data, the plurality of 
instructions comprising at least one instruction for 
clipping a result of an operation performed on the signal 
data prior to supplying the result to a destination 
register. 

25. The computer system of Claim 24, wherein said computer 
system is integrated on a semiconductor substrate. 
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